Project Report - Group 9

Analysis of Factors Influencing the Prize Money Earned by Tennis Players

Andrew Sun 46657300 | Eric Liu 55000681 | Ghazal (Giselle) Mirfallah 80760747| Keshav Gopinath 61086260 | Miranda Yang 80421225 | Ziya Wang 12935136

Introduction

1. Background information

Tennis is one of the most popular racquet-based sports played worldwide by countless individuals - from recreational players to elite athletes. Although the sport is loved and played by many, the professional men's tennis scene has been dominated by three top players for the past decade (Wertheim, 2019). Often referred to as 'the Big Three', these athletes are consistently winning the largest tournaments around the world, making it extremely difficult for other players to make a living off of the sport (2019).

2. Predictive question

With this in mind, our project focuses on one of the most controversial factors in men's tennis: players’ prize money. The wealth disparity in professional tennis is more drastic than other major sports, "with the best players traveling with entourages aboard private jets, and a good chunk of the field trying to break through without going broke" (Gay, 2019). To further examine this anomaly, we will study how prize money is affected by factors such as a player’s age, current ranking, and seasons played.

3. Data set

With data collected by “Ultimate Tennis Statistics” from the year 2019, this Player Stats dataset (player_stats. csv) provides information on top-ranking tennis players all around the world, with columns including age, ranking, nationality, and many more. To examine the relationship between prize money and various factors, we selected prize money, age, current ranking, and seasons-played as our main variables.

Methods & Results

In [1]:
install.packages("plotly")
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
In [2]:
library(tidyverse)
library(GGally)
library(caret)
library(plotly)
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
 ggplot2 3.2.0      purrr   0.3.2
 tibble  2.1.3      dplyr   0.8.3
 tidyr   0.8.3      stringr 1.4.0
 readr   1.3.1      forcats 0.4.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
 dplyr::filter() masks stats::filter()
 dplyr::lag()    masks stats::lag()
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

Attaching package: ‘GGally’

The following object is masked from ‘package:dplyr’:

    nasa

Loading required package: lattice

Attaching package: ‘caret’

The following object is masked from ‘package:purrr’:

    lift


Attaching package: ‘plotly’

The following object is masked from ‘package:ggplot2’:

    last_plot

The following object is masked from ‘package:stats’:

    filter

The following object is masked from ‘package:graphics’:

    layout

1. Preliminary and Exploratory Data Analysis

1.1 Original source - Data loading

In [3]:
# Load in the original data set
url <- "https://drive.google.com/uc?export=download&id=1_MECmUXZuuILYeEOfonSGqodW6qVdhsS"
download.file(url, destfile = "player_stats.csv")

player_stats <- read_csv("player_stats.csv")
glimpse(player_stats)
head(player_stats)
Warning message:
“Missing column names filled in: 'X1' [1]”Parsed with column specification:
cols(
  .default = col_character(),
  X1 = col_double(),
  `Turned Pro` = col_double(),
  Seasons = col_double(),
  Titles = col_double(),
  `Best Season` = col_double(),
  Retired = col_double(),
  Masters = col_double(),
  `Grand Slams` = col_double(),
  `Davis Cups` = col_double(),
  `Team Cups` = col_double(),
  Olympics = col_double(),
  `Weeks at No. 1` = col_double(),
  `Tour Finals` = col_double()
)
See spec(...) for full column specifications.
Observations: 500
Variables: 38
$ X1                 <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
$ Age                <chr> "26 (25-04-1993)", "18 (22-12-2001)", "32 (03-11-1…
$ Country            <chr> "Brazil", "United Kingdom", "Slovakia", "Korea, Re…
$ Plays              <chr> "Right-handed", "Left-handed", "Right-handed", "Ri…
$ Wikipedia          <chr> "Wikipedia", "Wikipedia", "Wikipedia", "Wikipedia"…
$ `Current Rank`     <chr> "378 (97)", "326 (119)", "178 (280)", "236 (199)",…
$ `Best Rank`        <chr> "363 (04-11-2019)", "316 (14-10-2019)", "44 (14-01…
$ Name               <chr> "Oscar Jose Gutierrez", "Jack Draper", "Lukas Lack…
$ Backhand           <chr> NA, "Two-handed", "Two-handed", "Two-handed", "Two…
$ `Prize Money`      <chr> NA, "$59,040", "US$3,261,567", "$374,093", "US$6,0…
$ Height             <chr> NA, NA, "185 cm", NA, "193 cm", NA, NA, NA, NA, NA…
$ `Favorite Surface` <chr> NA, NA, "Fast (H, G) 40%", NA, "Fast (H, G) 36%", …
$ `Turned Pro`       <dbl> NA, NA, 2005, NA, 2008, 2015, 2010, NA, NA, NA, 20…
$ Seasons            <dbl> NA, NA, 14, 2, 11, 5, 1, 1, 5, 3, 3, 5, 8, NA, 5, …
$ Active             <chr> NA, NA, "Yes", "Yes", "Yes", "Yes", NA, "Yes", "Ye…
$ `Current Elo Rank` <chr> NA, NA, "144 (1764)", NA, "100 (1826)", "33 (1983)…
$ `Best Elo Rank`    <chr> NA, NA, "60 (06-02-2012)", NA, "21 (23-03-2015)", …
$ `Peak Elo Rating`  <chr> NA, NA, "1886 (06-02-2012)", NA, "2037 (01-02-2016…
$ `Last Appearance`  <chr> NA, NA, "18-03-2019", "19-08-2019", "14-10-2019", …
$ Titles             <dbl> NA, NA, NA, NA, 4, 1, NA, NA, NA, NA, NA, 7, NA, N…
$ `GOAT Rank`        <chr> NA, NA, NA, NA, "264 (6)", "489 (1)", NA, NA, NA, …
$ `Best Season`      <dbl> NA, NA, NA, NA, 2015, 2019, NA, NA, NA, NA, 2019, …
$ Retired            <dbl> NA, NA, NA, NA, NA, NA, 2017, NA, NA, NA, NA, NA, …
$ Masters            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2, NA,…
$ Birthplace         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Residence          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Weight             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Coach              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Facebook           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Twitter            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Nicknames          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Grand Slams`      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Davis Cups`       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Web Site`         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Team Cups`        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Olympics           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Weeks at No. 1`   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Tour Finals`      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
A tibble: 6 × 38
X1AgeCountryPlaysWikipediaCurrent RankBest RankNameBackhandPrize MoneyFacebookTwitterNicknamesGrand SlamsDavis CupsWeb SiteTeam CupsOlympicsWeeks at No. 1Tour Finals
<dbl><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><dbl><dbl><chr><dbl><dbl><dbl><dbl>
026 (25-04-1993)Brazil Right-handedWikipedia378 (97) 363 (04-11-2019)Oscar Jose GutierrezNA NA NANANANANANANANANANA
118 (22-12-2001)United Kingdom Left-handed Wikipedia326 (119)316 (14-10-2019)Jack Draper Two-handed$59,040 NANANANANANANANANANA
232 (03-11-1987)Slovakia Right-handedWikipedia178 (280)44 (14-01-2013) Lukas Lacko Two-handedUS$3,261,567NANANANANANANANANANA
321 (29-05-1998)Korea, Republic ofRight-handedWikipedia236 (199)130 (10-04-2017)Duck Hee Lee Two-handed$374,093 NANANANANANANANANANA
427 (21-10-1992)Australia Right-handedWikipedia183 (273)17 (11-01-2016) Bernard Tomic Two-handedUS$6,091,971NANANANANANANANANANA
522 (11-02-1997)Poland Right-handedWikipedia31 (1398)31 (20-01-2020) Hubert Hurkacz Two-handed$1,517,157 NANANANANANANANANANA

1.2 Data cleaning and wrangling

After loading the Tennis Statistics data set, we see a significant amount of available information that are needed to be cleaned and wrangled for proper analysis.

The first step involved cleaning up the column names with make.names() for easier identification and the removal of unneccesary variable columns. We cut out the columns that were missing most values or were irrelevant to our final analysis. This left us with the columns Age, Seasons, Current.Rank, Best.Rank, and Prize.Money.

The second part involved cleaning up the values in the columns itself. We wanted these columns to include only numerical values, so we removed unnecessary spaces and extra values. We also used the as.numeric() function to convert each column to numerical.

The final column, Prize.Money, was the most challenging one to clean. We noticed many inconsistencies in formatting such as the occasional inclusion of either $ or US$ symbols as well as unnecessary commas and words. To remove everything other than the numerical amounts, we used the mutate() and gsub() functions to cut out specific characters and then used separate() to split the values with delimiters, removing the extra column with the irrelevant strings.

Finally, we added a Prize.Money.in.million column for easy readability in the distribution graph before creating the final data frame for analysis and processing.

In [4]:
# Cleaning Un-needed Columns
colnames(player_stats) <- make.names(colnames(player_stats))
player_stats <- select(player_stats, 
                       -c(X1, Name, Current.Elo.Rank:Tour.Finals, Plays, Wikipedia, Backhand, Favorite.Surface,Active, Height, Turned.Pro))
    
# Cleaning Prize.Money Column
player_stats <- player_stats %>%
    mutate(Prize.Money = gsub("\\$", "", Prize.Money)) %>%
    mutate(Prize.Money = gsub("US", "", Prize.Money)) %>%
    mutate(Prize.Money = gsub("\\,", "", Prize.Money)) %>%
    mutate(Prize.Money = gsub("^\\s+|\\s+$", "", Prize.Money)) %>%
    separate(col = Prize.Money , into = c("Prize.Money", "extra"), sep = " ") %>%
    select(-extra) %>%
    mutate(Prize.Money = as.numeric(Prize.Money)) %>% 
    mutate(Prize.Money.in.million = Prize.Money /1000000)

# Cleaning Age Column
player_stats <- player_stats %>%
    separate(col = Age , into = c("Age", "extra"), sep = " ") %>%
    select(-extra) %>%
    mutate(Age = as.numeric(Age))
    
# Cleaning Current.Rank Column
player_stats <- player_stats %>%
    separate(col = Current.Rank , into = c("Current.Rank", "extra"), sep = " ") %>%
    select(-extra) %>%
    mutate(Current.Rank = as.numeric(Current.Rank))

# Cleaning Best.Rank Column
player_stats <- player_stats %>%
    separate(col = Best.Rank , into = c("Best.Rank", "extra"), sep = " ") %>%
    select(-extra) %>%
    mutate(Best.Rank = as.numeric(Best.Rank))

# Removes NA rows and selects important columns and 
player_stats_selected  <-  player_stats  %>% 
    select(Age, Seasons, Current.Rank, Best.Rank, Prize.Money, Prize.Money.in.million)  %>% 
    na.omit()

head(player_stats_selected)
Warning message:
“Expected 2 pieces. Additional pieces discarded in 20 rows [16, 96, 128, 142, 208, 212, 267, 274, 337, 363, 364, 381, 394, 414, 423, 435, 436, 461, 468, 497].”Warning message:
“Expected 2 pieces. Missing pieces filled with `NA` in 397 rows [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 18, 19, 20, 21, 22, 23, ...].”
A tibble: 6 × 6
AgeSeasonsCurrent.RankBest.RankPrize.MoneyPrize.Money.in.million
<dbl><dbl><dbl><dbl><dbl><dbl>
3214178 4432615673.261567
21 2236130 3740930.374093
2711183 1760919716.091971
22 5 31 3115171571.517157
28 1307213 2787090.278709
21 1232229 591230.059123

1.3 Summary statistics

In [5]:
# Summary statistics

player_stats_summary <- summarize(player_stats_selected,
                                  min_Prize = min(Prize.Money),
                                  max_Prize = max(Prize.Money), 
                                  mean_Prize = mean(Prize.Money),
                                  median_Prize = median(Prize.Money),
                                  IQR = IQR(Prize.Money),
                                  sd = sd(Prize.Money))

player_stats_summary
A tibble: 1 × 6
min_Prizemax_Prizemean_Prizemedian_PrizeIQRsd
<dbl><dbl><dbl><dbl><dbl><dbl>
1451391449443884855661836282277412795834

The summary statistics above show that the values of Prize.Money are very widely distributed (spread out from $145 to $139,144,944) with an interquartile range of $2,822,774 and a standard deviation of $12,795,834.

In order to determine how price money is affected by various predictors, we need to look deeper into its distribution; to do this, we visualized the distribution of Prize.Money.in.million using a density plot and a histogram.

In [6]:
options(repr.plot.width = 6, repr.plot.height = 4)

# Density plot for prize value distribution
prize_distribution_plot <- ggplot(player_stats_selected, aes(x = Prize.Money.in.million)) + 
     geom_density(fill = "lightblue",
                   color = 'steelblue') +
     labs( x = "Prize Money (in millions)", y = "Distribution density") +
     ggtitle("Figure 1.3.1  Prize Money Distribution(Density)")

prize_distribution_plot

# Histogram for prize value distribution
prize_hist <- player_stats_selected %>% 
    ggplot(aes(x = Prize.Money.in.million)) + 
       geom_histogram(bins = 50,
                      position = "identity",
                      fill = "lightblue",
                      color = 'steelblue') +
       labs(x = "Prize Money (in millions)", y = "Count of players") +
       ggtitle("Figure 1.3.2  Prize Money Distribution")
prize_hist

As shown in the plots above, the distribution of the prize money is extremely skewed due to extreme outliers, further emphasizing the drastic wealth disparity between the top few players and everyone else. Since the prize money is the centre of focus for our project, this extremely unbalanced distribution is a major obstacle. Therefore, it is important to keep this in mind as we carry out the rest of our analysis, given that these outliers could influence our results.

1.4 Visualization for exploratory analysis

At this stage, we are considering using a multivariate prediction model with 4 predictors while also deciding between classification and regression. To narrow down our list of options, we plotted all the potential predictor variables in one ggpair plot to assess the linear relationships between Prize.Money and each of the four predictors: Age, Seasons, Current.Rank, and Best.Rank.

In [7]:
options(repr.plot.width = 9, repr.plot.height = 6)

ggpair_plot  <- player_stats_selected %>%  
    select(-Prize.Money) %>% 
    ggpairs() +
    ggtitle("Figure 1.4  Correlation with All Predictors")
                
ggpair_plot

After analyzing Figure 1.4, we noticed the four selected predictors have much lower correlations with Prize.Money than we expected. In response, we made the following decisions for building the prediction model:

  1. Remove the ‘Best.Rank’ predictor:

    • The relatively highest one, Seasons, only has a moderate correlation (0.447).
    • Age and Current.Rank have weak correlations of 0.31 and -0.307 respectively.
    • Best.Rank has the weakest correlation (-0.299) with Prize.Money; For this reason, we decided to exclude Best.Rank from the predictors.
  2. Use classification instead of regression:

    • As mentioned, none of the variables have a reasonable linear correlation with Prize.Money, so a linear regression model would be inaccurate for making predictions. Regarding the low correlation values, we also decided against using K-nearest neighbour (k-nn) regression, as classification would give us a more intuitive prediction (i.e. a range) rather than the exact quantitative values of k-nn regression. Predicting a particular number is less reliable than predicting a general range of values in which that value could lie.

Therefore, we decided to categorize the numeric values in Prize.Money into separate tiers to build a k-nn classification model.

2. Classification analysis

2.1 Classification preparation - Categorize numeric values to class labels

We decided to divide the prize money using the quantile() function.

How we divided the distribution initially:

Initially, we just tried to divide the prize values into five brackets and cut them at $1/6$, $2/6$, $4/6$, $5/6$. These give us the folowing percentiles:

  • 0% - 16.5% (really low)
  • 16.5% - 33% (low)
  • 33% - 67% (medium)
  • 67% - 83.5% (high)
  • 83.5% - 100% (really high)
In [8]:
# Retrieving prize money information for the quantile function
prize_money_percentiles = player_stats_selected %>% 
    select(Prize.Money)  %>% 
    na.omit()  %>% 
    pull()

# Making a percentile with respect to 16.5% 33% 67% and 83.5%
quantile(prize_money_percentiles, prob=c(.165,.33,.67,.835))
16.5%
117744.6
33%
302981.7
67%
1591534.75
83.5%
5258947.875
In [9]:
# Creates categorization using these percentiles
player_stats_bad_classification <- player_stats_selected  %>% 
     mutate(prize.money.classified =   ifelse(Prize.Money < 117744.6, "Really Low Amount (Less than $117,744.60)",
                                            ifelse(Prize.Money <302981.7, "Low Amount, (Less than $302,981.70)",
                                            ifelse(Prize.Money < 1591534.75, "Medium Amount",
                                            ifelse(Prize.Money < 5258947.875, "High Amount, (More than $1,591,534.75)", 
                                                   "Very High Amount (More than $5,258,947.88)")))))

# Refactors categorization in correct order
player_stats_bad_classification$prize.money.classified = factor(player_stats_bad_classification$prize.money.classified, levels = 
                                                      c("Really Low Amount (Less than $117,744.60)",
                                                        "Low Amount, (Less than $302,981.70)", 
                                                        "Medium Amount", 
                                                        "High Amount, (More than $1,591,534.75)",
                                                        "Very High Amount (More than $5,258,947.88)"))
# Plot unfavourable percentiles for comparison
plot_bad = player_stats_bad_classification  %>% 
    ggplot(aes(x = prize.money.classified))  +
    geom_bar() + 
    ggtitle("Figure 2.1.1 Unfavourable categorization for comparison") +
    theme(axis.text.x = element_text(angle = 70, hjust = 1)) + 
    labs(x = "Prize Money Classes", y = "Number of players") 

# Heads our dataframe
head(player_stats_bad_classification)
A tibble: 6 × 7
AgeSeasonsCurrent.RankBest.RankPrize.MoneyPrize.Money.in.millionprize.money.classified
<dbl><dbl><dbl><dbl><dbl><dbl><fct>
3214178 4432615673.261567High Amount, (More than $1,591,534.75)
21 2236130 3740930.374093Medium Amount
2711183 1760919716.091971Very High Amount (More than $5,258,947.88)
22 5 31 3115171571.517157Medium Amount
28 1307213 2787090.278709Low Amount, (Less than $302,981.70)
21 1232229 591230.059123Really Low Amount (Less than $117,744.60)

However, Figure 2.1.1 below shows that our intial select creates an inefficient distribution of how the prize money can be classified. Based on what we learned from the origianl distribution and what we observed here, we decided to modify the percentiles in more details.

In [10]:
options(repr.plot.width = 6, repr.plot.height = 5)
plot_bad

After seeing the results of dividing the data the previous way, the "low" and "high" earning brackets had a wide range of earnings relative to the range of the "medium" earnings bracket. In particular, the "high" earning bracket (67% - 100%) had an abnormally wide range of earning amounts as it included the extreme outliers as well as moderate values. Thus, we decided to split the data into more detailed classification brackets:

  • 0% - 10% (very low)
  • 10% - 33% (low)
  • 33% - 67% (medium)
  • 67% - 90% (high)
  • 90% - 100% (very high)
In [11]:
# Making a percentile with respect to 10% 33% 67% and 90%
quantile(prize_money_percentiles, prob=c(.1,.33,.67,.9))
10%
65411
33%
302981.7
67%
1591534.75
90%
8548203.5
In [12]:
# Re-categorize Prize.Money
player_stats_classes <- player_stats_selected %>% 
      mutate(prize.money.classified =   ifelse(Prize.Money < 65411, "Really Low Amount (Less than $65411.00)",
                                            ifelse(Prize.Money <302981.7, "Low Amount, (Less than $302981.70)",
                                            ifelse(Prize.Money < 1591534.75, "Medium Amount",
                                            ifelse(Prize.Money < 8548203.5, "High Amount, (More than $1,591,534.75)", 
                                                   "Very High Amount (More than $8,548,203.50)"))))) %>% 
    mutate(prize.money.classified = as.factor(prize.money.classified))

# Re-categorize price money classes
player_stats_classes$prize.money.classified = factor(player_stats_classes$prize.money.classified, levels = 
                                                      c("Really Low Amount (Less than $65411.00)",
                                                        "Low Amount, (Less than $302981.70)", 
                                                        "Medium Amount", 
                                                        "High Amount, (More than $1,591,534.75)",
                                                        "Very High Amount (More than $8,548,203.50)"))
# Plot final categorization
prize_class_barplot <- player_stats_classes %>% 
    ggplot(aes(x = prize.money.classified))  +
    geom_bar() +
    theme(axis.text.x = element_text(angle = 70, hjust = 1))+
    labs(x = "Prize Money Classes", y = "Number of players")+
    ggtitle("Figure 2.1.2 Final categorization of prize money")

# Heads our dataframe        
head(player_stats_classes)
A tibble: 6 × 7
AgeSeasonsCurrent.RankBest.RankPrize.MoneyPrize.Money.in.millionprize.money.classified
<dbl><dbl><dbl><dbl><dbl><dbl><fct>
3214178 4432615673.261567High Amount, (More than $1,591,534.75)
21 2236130 3740930.374093Medium Amount
2711183 1760919716.091971High Amount, (More than $1,591,534.75)
22 5 31 3115171571.517157Medium Amount
28 1307213 2787090.278709Low Amount, (Less than $302981.70)
21 1232229 591230.059123Really Low Amount (Less than $65411.00)

Figure 2.1.2 shows the distribution of the quantile using our new percentiles. Now, one can see that we achieve a much more normal distribution.

In [13]:
options(repr.plot.width = 6, repr.plot.height = 5)
prize_class_barplot

2.2 Pre-process - Oversampling for Class Imbalance

For classification, we need to rebalance the data, as the readings in some classifications such as the 0% -10% bracket and 90-100% bracket have a considerably lower number of observations than other brackets. Thus, we applied oversampling ( upSample() ) the data to ensure equal voting power for all classes.

In [14]:
player_stats_oversampled <- upSample(x = select(player_stats_classes, Seasons, Current.Rank, Age, Prize.Money, prize.money.classified), 
                                y = select(player_stats_classes, prize.money.classified) %>% unlist())
                       
glimpse(player_stats_oversampled)
Observations: 620
Variables: 6
$ Seasons                <dbl> 1, 1, 1, 1, 1, 1, 3, 4, 1, 1, 1, 2, 1, 1, 5, 1…
$ Current.Rank           <dbl> 232, 448, 228, 432, 374, 467, 494, 323, 370, 4…
$ Age                    <dbl> 21, 26, 21, 22, 26, 23, 23, 23, 19, 20, 25, 21…
$ Prize.Money            <dbl> 59123, 32892, 43346, 62529, 60865, 29228, 3347…
$ prize.money.classified <fct> Really Low Amount (Less than $65411.00), Reall…
$ Class                  <fct> Really Low Amount (Less than $65411.00), Reall…
In [15]:
options(repr.plot.width = 9, repr.plot.height = 5)

prize_oversample_barplot <- player_stats_oversampled %>% 
    ggplot(aes(x = prize.money.classified))  +
    geom_bar() +
    theme(axis.text.x = element_text(angle = 55, hjust = 1))+
    labs(x = "Prize Money Classes", y = "Number of players")+
    ggtitle("Figure 2.2 Distribution after oversampling")

prize_oversample_barplot

After oversampling, the number of observations in each class are equal, as shown in Figure 2.2 above.

2.3 Classification analysis

With all the pre-processing completed, we are now ready to build our classifier.

2.3.1 Creating the training and test set split

To start, we split the data into training and test set, and put the testing data aside.

In [16]:
# Creating Training and Test Sets
set.seed(2020)
training_rows  <- player_stats_oversampled %>% 
    select(prize.money.classified) %>% 
    unlist() %>% 
    createDataPartition(p=0.75,list=FALSE)

X_train <- player_stats_oversampled  %>% 
    select(Age,Seasons,Current.Rank) %>% 
    slice(training_rows)  %>% 
    data.frame()

Y_train <- player_stats_oversampled  %>% 
    select(prize.money.classified) %>% 
    slice(training_rows) %>% 
    pull()

X_test<- player_stats_oversampled  %>% 
    select(Age,Seasons,Current.Rank) %>% 
    slice(-training_rows)  %>% 
    data.frame()

Y_test  <- player_stats_oversampled  %>% 
    select(prize.money.classified) %>% 
    slice(-training_rows)  %>% 
    pull()

2.3.2 Standardization

In this section, we standardize and scale the columns we will use for our plots. The values in the columns Current.Rank and Age have considerably different ranges, as rankings range from one to several hundreds, while age is roughly limited between 20 and 40 years. To ensure that both parameters have an equal influence on the predictions, we used a scale_transformer to standardize the data.

In [17]:
# Scaling Training and Test Set
set.seed(2020)
scale_transformer <- preProcess(X_train, method = c("center", "scale")) 
X_train_scaled <- predict(scale_transformer, X_train)
X_test_scaled <- predict(scale_transformer, X_test)

head(X_train_scaled)
head(X_test_scaled)
A data.frame: 6 × 3
AgeSeasonsCurrent.Rank
<dbl><dbl><dbl>
-1.1732666-1.04475240.1601854
-1.1732666-1.04475240.1329385
-0.9766651-1.04475241.5225303
-0.1902594-1.04475241.1274503
-0.7800637-1.04475241.7609407
-0.7800637-0.69996191.9448572
A data.frame: 6 × 3
AgeSeasonsCurrent.Rank
<dbl><dbl><dbl>
-0.1902594-1.04475241.6315179
-1.5664694-1.04475241.1002034
-0.3868609-1.04475241.5838358
-1.5664694-1.04475240.2623613
-0.5834623-0.52756661.9244220
-1.3698680-0.87235721.4407896

2.3.3 Cross-validation

Next, we perform cross-validation with n = 10 to select the best k value for our classifer. Cross-validation provides us the unbiased estimate of the k by splitting the data in multiple equal splits and using the average of the splits as the estimate. Consequently, the k with the best average accuracy will be chosen for us by cross-validation.

In [18]:
# input multiple k values, the model will evaluate the best k value to use 
set.seed(2020)
ks = data.frame(k = seq(from = 1, to = 51, by = 2)) 

train_control  <- trainControl(method='cv',number = 10)
knn_model_cv_10  <- train(x=X_train_scaled,
                          y=Y_train,method='knn',
                          tuneGrid=ks ,
                          trControl=train_control)
In [19]:
# plotting k against model accuracy
set.seed(2020)
accuracies  <- knn_model_cv_10$results
accuracy_vs_k  <- ggplot(accuracies,aes(x=k,y=Accuracy))+
    geom_point()+
    geom_line()+
    labs(x='k value',y='Model Accuracy') +
    ggtitle("Figure 2.3.3 Model accuracy v.s K values")
accuracy_vs_k
In [20]:
set.seed(2020)
best_k  <- knn_model_cv_10$results %>% 
    filter(Accuracy == max(Accuracy)) %>%
    select(k) %>%
    pull() 

best_accuracy  <- knn_model_cv_10$results %>% 
    filter(Accuracy == max(Accuracy)) %>%
    select(Accuracy) %>%
    pull() 
    

best_k
best_accuracy
1
0.801585209168465

The choice of K

By graphing the accuracy results of our n = 10 cross validation model, we could confirm that our accuracy value is the highest at k = 1 and then proceeds to drop immediately afterwards. Therefore, we choose k = 1 as the optimal value, as we have concluded that the ~80% accuracy gives us a precise enough model.

2.3.4 Retrain the classification model with the optimal K

In [21]:
# Re-train the final model with the selected K
set.seed(2020)
k = data.frame(k=best_k)


knn_model_best  <- train(x = X_train_scaled,
                         y = Y_train,
                         method = 'knn',
                         tuneGrid = k )
print(knn_model_best)
k-Nearest Neighbors 

465 samples
  3 predictor
  5 classes: 'Really Low Amount (Less than $65411.00)', 'Low Amount, (Less than $302981.70)', 'Medium Amount', 'High Amount, (More than $1,591,534.75)', 'Very High Amount (More than $8,548,203.50)' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 465, 465, 465, 465, 465, 465, ... 
Resampling results:

  Accuracy   Kappa    
  0.7563822  0.6947914

Tuning parameter 'k' was held constant at a value of 1

2.3.5 Predict and evaluate the estimated accuracy on test data

In [22]:
# Predict on test set using retrained model
set.seed(2020)
Y_predicted  <- predict(object = knn_model_best, X_test_scaled)

# Comparing predicted classification with actual classification of the test data
model_quality  <- confusionMatrix(data = Y_predicted, reference = Y_test)
model_quality
Confusion Matrix and Statistics

                                            Reference
Prediction                                   Really Low Amount (Less than $65411.00)
  Really Low Amount (Less than $65411.00)                                         31
  Low Amount, (Less than $302981.70)                                               0
  Medium Amount                                                                    0
  High Amount, (More than $1,591,534.75)                                           0
  Very High Amount (More than $8,548,203.50)                                       0
                                            Reference
Prediction                                   Low Amount, (Less than $302981.70)
  Really Low Amount (Less than $65411.00)                                     4
  Low Amount, (Less than $302981.70)                                         22
  Medium Amount                                                               5
  High Amount, (More than $1,591,534.75)                                      0
  Very High Amount (More than $8,548,203.50)                                  0
                                            Reference
Prediction                                   Medium Amount
  Really Low Amount (Less than $65411.00)                1
  Low Amount, (Less than $302981.70)                     5
  Medium Amount                                         20
  High Amount, (More than $1,591,534.75)                 5
  Very High Amount (More than $8,548,203.50)             0
                                            Reference
Prediction                                   High Amount, (More than $1,591,534.75)
  Really Low Amount (Less than $65411.00)                                         0
  Low Amount, (Less than $302981.70)                                              1
  Medium Amount                                                                   1
  High Amount, (More than $1,591,534.75)                                         28
  Very High Amount (More than $8,548,203.50)                                      1
                                            Reference
Prediction                                   Very High Amount (More than $8,548,203.50)
  Really Low Amount (Less than $65411.00)                                             0
  Low Amount, (Less than $302981.70)                                                  0
  Medium Amount                                                                       1
  High Amount, (More than $1,591,534.75)                                              1
  Very High Amount (More than $8,548,203.50)                                         29

Overall Statistics
                                          
               Accuracy : 0.8387          
                 95% CI : (0.7712, 0.8928)
    No Information Rate : 0.2             
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.7984          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Really Low Amount (Less than $65411.00)
Sensitivity                                                  1.0000
Specificity                                                  0.9597
Pos Pred Value                                               0.8611
Neg Pred Value                                               1.0000
Prevalence                                                   0.2000
Detection Rate                                               0.2000
Detection Prevalence                                         0.2323
Balanced Accuracy                                            0.9798
                     Class: Low Amount, (Less than $302981.70)
Sensitivity                                             0.7097
Specificity                                             0.9516
Pos Pred Value                                          0.7857
Neg Pred Value                                          0.9291
Prevalence                                              0.2000
Detection Rate                                          0.1419
Detection Prevalence                                    0.1806
Balanced Accuracy                                       0.8306
                     Class: Medium Amount
Sensitivity                        0.6452
Specificity                        0.9435
Pos Pred Value                     0.7407
Neg Pred Value                     0.9141
Prevalence                         0.2000
Detection Rate                     0.1290
Detection Prevalence               0.1742
Balanced Accuracy                  0.7944
                     Class: High Amount, (More than $1,591,534.75)
Sensitivity                                                 0.9032
Specificity                                                 0.9516
Pos Pred Value                                              0.8235
Neg Pred Value                                              0.9752
Prevalence                                                  0.2000
Detection Rate                                              0.1806
Detection Prevalence                                        0.2194
Balanced Accuracy                                           0.9274
                     Class: Very High Amount (More than $8,548,203.50)
Sensitivity                                                     0.9355
Specificity                                                     0.9919
Pos Pred Value                                                  0.9667
Neg Pred Value                                                  0.9840
Prevalence                                                      0.2000
Detection Rate                                                  0.1871
Detection Prevalence                                            0.1935
Balanced Accuracy                                               0.9637
In [23]:
model_quality$overall[1]
Accuracy: 0.838709677419355

With the retrained final classifier, our model predicted the test data set with a prediction accuracy of over 80%.

This prediction accuracy validated our decision to keep three predictors instead of reducing to two. We tried the model with different pairs of predictors but the prediction accuracy with all three variables was notably higher than the others. Intuitively, using all three variables makes the most sense because the amount of prize money a player makes is not solely dependent on one or two variables; there are many factors that contribute to how much money someone makes, therefore predictions should be made accordingly.

Based on the results of the confusionMatrix, our model perfectly predicted the Really Low Amount class. In contrast, the model made a few prediction errors in the other classes ( Low Amount , Middle Amount, High Amount and Really High Amount).

2.4 Visualization of the classification analysis

In [24]:
options(repr.plot.width = 8, repr.plot.height = 5)
training_oversampled_scaled <- bind_cols(X_train_scaled, data.frame(price_class = Y_train))

prize_class_seasons_vs_age <- plot_ly(training_oversampled_scaled, x = ~Seasons, y = ~Age, z = ~Current.Rank, color = ~price_class, size = 5, opacity = 0.8)
prize_class_seasons_vs_age <- prize_class_seasons_vs_age %>% add_markers()
prize_class_seasons_vs_age <- prize_class_seasons_vs_age %>% 
    layout(scene = list(xaxis = list(title = 'Seasons Played'),
                     yaxis = list(title = 'Age'),
                     zaxis = list(title = 'Current Rank'))) 

prize_class_seasons_vs_age

In order to visualize the classification analysis, we created a 3-dimensional scatterplot using x = Seasons, y = Age, and z = Current.Rank with the colour of each observation being its prize money class. The plot shows that the players who have won the most prize money are generally those who are older in age, have played more seasons, and have a better current ranking.

Discussion

1. summarize the findings

In conclusion, we found that professional tennis players that are older, more experienced, and have a better current ranking are generally the ones who have made the most money throughout their careers. That being said, the correlation between each of these three variables and prize money is not that strong. This suggests that each variable on its own is not a good predictor of a tennis player’s prize money; however, plotting these variables all together on a 3-dimensional scatterplot shows a general trend regarding which players have made the most money. Therefore, this model could be used for k-nn classification in order to predict how much money a male tennis player could earn based on their age, current ranking, and the number of seasons played.

2. compare to expectations

Before conducting this study, we expected that players who were older, higher ranked, and more experienced would have won more prize money throughout their careers because it intuitively makes sense. Naturally, athletes who have played professionally for a long time and are ranked higher within the league are the ones who have won more tournaments, thus also having won more prize money.

Additionally, we initially expected each individual predictor to have a stronger linear correlation with prize money. However, the exploratory analysis shows that, at best, they have a moderate correlation. This may be due to the abnormal outliers that made our prize money distribution extremely skewed. As mentioned in the introduction, the three top-ranked tennis players have dominated the tennis league for the past decade and won almost all major tournaments. These abnormalities affected the statistics of the dataset and the linear correlations between each variable and prize money, therefore influencing the accuracy of our model as well.

3. potential impacts

These findings can be significant with helping people to plan for their lives accordingly. Using this model, male tennis players could gain insight into how much money they would make on this career path. Similarly, parents of aspiring players could get an estimate of what factors to consider if their child wants to be successful in this field. This gives them more insights into how they can help their child plan their future.

Additionally, questions involving income often raise concerns regarding equality as well. With that in mind, our model can provide more insight into whether any form of inequality/discrimination is a problem in this industry. For example, suppose our model predicts that a player should be making a lot more than they actually are. In that case, further investigation may be encouraged to determine whether this discrepancy is caused by variables such as nationality, race, etc. In order to improve our world, we need to address equality issues and consider them while analyzing statistics.

4. future questions

There are limitations to our findings that should be noted and improved for future experiments. In particular, 3D data cannot be analyzed as accurately as 2D data due to the distorted effect of perspective; therefore, the data’s readability decreased when we plotted it using the 3D scatterplot. In future studies, it may be better to use different visualization techniques that may be outside the scope of the DSCI 100 course material. This data set and its analysis opens the field to various new questions for future studies such as: “How would removing extreme outliers affect the analysis and results of this prediction model?” “What is the golden age for children to start playing tennis if they want to become professional tennis players?” “Which countries have won the most awards and/or have a better program for developing tennis skills in children?” These questions should be analyzed separately, however the same tidy data and overall methods can be used. For the next experiment, it would be best to minimize the limitations of the study; no data or experiment can be perfect, but we should never stop enhancing and improving.

References

Gay, J. (2019, Aug 27). Telling the truth about tennis --- noah rubin's instagram series 'behind the racquet' chronicles the humble real world outside the top 10. Wall Street Journal. Retrieved from http://ezproxy.library.ubc.ca/login?url=https://search-proquest-com.ezproxy.library.ubc.ca/docview/2280317831?accountid=14656

Timbers, T., Campbell, T., & Lee, M. (2020). Introduction To Datascience [Ebook]. https://ubc-dsci.github.io/introduction-to-datascience/

Wertheim, J. (2019). Holding Court. Sports Illustrated. 130(16), 22-22.

In [ ]: